Mapping of Sequence Reads to the Reference Genomes ◾ 85
The VCF can be compressed using the Linux compression program “bgzip” and then the
compressed file is indexed using tabix program, which is a tool for indexing large bioinfor-
matics text files. The tabix program can be installed on Linux using:
sudo apt install tabix
The following commands compress the VCF file and index it:
bgzip -c SRR769545.vcf > SRR769545.vcf.gz
tabix -p vcf SRR769545.vcf.gz
The final step in the reference-guided genome assembly is to create a consensus sequence
by transferring the sequence reference genome, which was used to create the original BAM
file, to the “bcftools consensus” command that utilizes the indexed variant call file to cre-
ate a new genome sequence for the individual studied.
cat ../ref/hg38.fa \
| bcftools consensus SRR769545.vcf.gz \
> SRR769545_genome.fasta
The “SRR769545_genome.fasta” is the FASTA genome sequence that incorporates variants
genotyped for the individual from whom the whole genome was sequenced. If a region of
the reference genome is uncovered by the reads, this assembly method may skip some vari-
ants; therefore, high sequence coverage is required. Instead of incorporating the uncovered
regions of the reference genome in the new sequence, the bases of those regions can be
masked by Ns. There are several more advanced programs for reference-guided genome
such as RATT [19].
There is a different approach to sequence a genome of an organism from scratch without
using a reference genome. This approach is called de novo assembly which is discussed in
Chapter 3.
2.6 SUMMARY
Except for the sequencing applications that use de novo genome assembly, read mapping
to a reference genome is the most fundamental step in the workflow of the sequencing
data analysis. In the NGS or TGS, the DNA molecules are fragmented into pieces in the
library preparation step and then the DNA libraries are sequenced to produce the raw data
in the form of millions of reads of specific lengths. The lengths of the reads produced by
a high-throughput sequencing instrument vary based on the technology used into short
reads (50–400 bp) or long reads (>400 bp). The quality control step to assess and prepro-
cess the raw reads is essential to reduce the errors in the base calling and the biases that
may arise due to the presence of technical reads. Indeed, read mapping is the most com-
putationally expensive step in the sequencing data analysis workflow. That is because the
alignment program attempts to determine the points of origin for millions or billions of
reads in a reference genome. The alignment requires even more efforts for RNA-Seq read